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Abstract 



H, 

The quadruple (1484 801,1203120,1169 407,1157 520) already known 
. is essentially the only non-trivial solution of the Diophantine equation 

I + 2y'^ = + Aw'^ for \y\, \z\, and \w\ up to one hundred million. 

We describe the algorithm we used in order to establish this result, thereby 
explaining a number of improvements to our original approach [EJj . 

> \ 

^ ; 1 Introduction 
^ ■ 

^ _ 1.1. In [EJ], we described a systematic method to search efficiently for all 

O ' solutions of a Diophantine equation of the form 

o ■ 

f{xi, ...,Xn)= g{yi, ■■■,ym) 



^ I which are contained within the (n + m)-dimensional cube 

P^. ; {(xi, . . . , 1/1, . . . , y^) G I \xl \y,\ < B}. 

The expected running-time of this algorithm is 0{B^^^^'^'"^^). 

1.2. The basic idea is as follows. 

Algorithm H. 

i) Evaluate / on all points of the n- dimensional cube {(xi, . . . , Xn) E \ \xi\ < B}. 
Store the values within a set L. 
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ii) Evaluate g on all points of the cube {{yi, ■ ■ ■ ,ym) G Z™" | \yi\ < B} of dimen- 
sion m. For each value start a search in order to find out whether it occurs in L. 
When a coincidence is detected, reconstruct the corresponding values of xi, . . . ,Xn 
and output the solution. 

1.3. Remarks. 



a) In fact, we are interested in the very particular Diophantine equation 
+ 2?/^ = + 4w^ which was suggested by Sir Peter Swinnerton-Dyer. It is 
unknown whether this equation admits finitely or infinitely many primitive solu- 
tions. If their number were actually finite then this would settle a famous open 
problem in the arithmetic of K3 surfaces |PTt Problem/Question 6. a)]. 

b.i) In the form stated above, the main disadvantage of Algorithm H is that it 
requires an enormous amount of memory. Actually, the set L is too big to be stored 
in the main memory even of our biggest computers, already when the value of B is 
only moderately large. 

For that reason, we introduced the idea of paging. We choose a page prime Pp and 
work with the sets Lr := {s E L \ s = r (mod Pp)} for r = 0, . . . ,Pp — 1, separately. 
At the cost of some more time spent on initializations, this yields a reduction of the 
memory space required by a factor of ^. 

ii) The sets were implemented in the form of a hash table with open addressing. 

iii) It is possible to achieve a further reduction of the running-time and the memory 
required by making use of some obvious congruence conditions modulo 2 and 5. 

c) Precisely ten primitive solutions of the Diophantine equation + 2y'^ = + Aw"^ 
are known up to now. Among them, there are the two obvious ones (±1:0: ±1:0). 

Furthermore, by an implementation of Algorithm H, the non-obvious solutions 
(±1 484 801 : ±1 203 120 : ±1 169 407 : ±1 157 520) were found. We searched through 
the hypercube {{x,y,z,w) G Z"^ | jti?! < 2.5- 10®}. Details are given 

in lETl. 



1.4. The goal of this note is to describe an improved implementation of Algo- 

rithm H which we used in order to find all solutions of x'^ + 2y'^ = z'^ + Aw'^ contained 
within the hypercube {(x, y, z, w) E 'Z'^ \ \x\, \y\,\z\,\w\ < 10^}. 

Unfortunately, our result is not very spectacular. There is no new primitive solution. 



2 More Congruences 



2.0.1. The most obvious way to further reduce the size of the sets and 

to increase the speed of Algorithm H is to find further congruence conditions for 
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solutions and evaluate / and g only on points satisfying these conditions. As the 
equation, we are interested in, is homogeneous, it is sufficient to restrict considera- 
tion to primitive solutions. 

2.0.2. It should be noticed, however, that this idea is subject to strict limi- 

tations. If we were using the most naive 0(i?"'"'"™')-algorithm then, for more or less 
every / G N, the congruence /(xi, . . . , x„) = g{yi, . . . , ym) (mod I) caused a reduc- 
tion of the number of {n + m)-tuples to be checked. For Algorithm H, however, the 
situation is by far less fortunate. 

One may gain something only if there are residue classes (r mod /) which are rep- 
resented by /, but not by g, or vice versa. Values, the residue class of which is not 
represented hj g, do not need to be stored into L^. Values, the residue class of which 
is not represented by /, do not need to be searched for. 

Unfortunately, if / is prime and not very small then the Weil conjectures ensure 
that all residue classes modulo I are represented by both / and g. In this case, 
the idea fails completely. The same is, however, not true for prime powers / = p^. 
Hensel's Lemma does not work when all partial derivatives 4^(xi, . . . , respec- 
tively Q^{yi, . . . , ym), are divisible by p. This makes it possible that certain residue 
classes (r mod p^) are not representable although (r mod p) is. 

2.1 The prime 5. Congruences modulo 625 

2.1.1. In |EJ] . we made use of the fact that y is always divisible by 5. How- 
ever, at this point, one can do a lot better. When one takes into consideration that 

= 1 (mod 5) for every a G Z not divisible by 5, a systematic inspection shows 
that there are actually two cases. 

Either, 5\w. Then, 5\x and 5\z. Or, otherwise, 5\x. Then, 5\z and 5\w. Note that, 
in the latter case, one indeed has + 4w^ = 1 + 4 = (mod 5). 

2.1.2. The Case 5\w. We call this case "N" and use the letter N at a 

prominent position in the naming of the relevant ffies of source code. N stands 
for "normal" . To consider this the ordinary one is justified by the fact that 

all primitive solutions known actually belong to it. Note, however, that we have no 
theoretical reason to believe that this case should in whatever sense be better than 
the other one. 

In case N, we rearrange the equation to /Ar(a:, z) = gN^y, w) where 

/Ar(x, z) := - / and gN{y, w) := Aw'^ - 2y^. 

As y and w are both divisible by 5, we get gN{y,w) = Aw^ — 2y^ = (mod 625). 
Consequently, f^ix^z) = (mod 625). 
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This yields an enormous reduction of the set L^. To see this, recall 5\x and 5\z. 
That means, for x, there are precisely (p{625) possibilities in Z/625Z. Further, for 

each such value, the congruence z'^ = (mod 625) may not have more than four so- 
lutions. All in all, there are 4 • (^(625) = 2 000 possible pairs {x,z) e {Z/625Zf. 

Further, these pairs are very easy to find, computationally. The fourth roots of unity 
modulo 625 are ±1 and ±182. For each x e Z/625Z*, put z := {±x mod 625) and 
z := (±182a; mod 625). 

We store the values of /at into the set L^. Only 2 000 out of 625^ values (0.512%) 
need to be computed and stored. Then, each value of is looked up in L^. Here, as 
y and w are both divisible by 5, only one value out of 25 (4%) needs to be computed 
and searched for. 

2.1.3. The Case 5\x. We call this case "S" and use the letter S at a promi- 

nent position in the naming of the relevant files of source code. S stands for "Son- 
derfall" which means "exceptional case". It is not known whether there exists a 
solution belonging to case S. 

Here, we simply interchange both sides of the equation. Define 

fs{z, w) := z'^ + Aw'^ and gs{x, y) := x'^ + 2y^. 

As X and y are divisible by 5, we get x'^ + 2y^ = (mod 625) and, therefore, 
z^ + Aw"^ = (mod 625). 

Again, this congruence allows only 4 • (p{625) — 2 000 solutions {z,w) e (Z/625Z)^ 
and these pairs are easily computable, too. The fourth roots of (—4) in Z/625Z are 
±181 and ±183. For each x G Z/625Z*, one has to consider z :— (±181x mod 625) 

and z := (±183a; mod 625). 

We store the values of fs into the set L^. Then, we search through Lj. for the values 
of gs- As above, only 2 000 out of 625^ values need to be computed and stored and 
one value out of 25 needs to be computed and searched for. 

2.2 The prime 2 

2.2.1. Any primitive solution is of the form that x and z are odd while y 

and w are even. 

2.2.2. In case S, there is no way to do better than that as both fs and gs 

represent (r mod 2'') for A; > 4 if and only if r = 1 (mod 16). 

In case N, the situation is somewhat better. gN{y, w) — Aw^ — 2y^ is always divisible 
by 32 while fj^{x,z) — x^ — z^ = Q (mod 32), as may be seen by inspecting the 
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fourth roots of unity modulo 32, implies the condition x = ±2; (mod 8). This may 
be used to halve the size of L^. 



2.3 The prime 3 

2.3.1. Looking for further congruence conditions, a primitive solution must 

necessarily satisfy, we did not find any reason to distinguish more cases. But there 
are a few more congruences which we used in order to reduce the size of the sets L^. 

To explain them, let us first note two theorems on binary quadratic forms. They may 
both be easily deduced from |HWt Theorems 246 and 247]. 

2.3.2. Theorem. The quadratic forms qi{a,b) := a^+b'^ , q2{a,b) := a'^—2b'^ , 

and qsla, b) := + 26^ admit the property below. 

Suppose no := gj(ao,&o) is divisible by a prime p which is not represented by qi. 
Then, p|ao and p\bQ. 

2.3.3. Theorem. A prime number p is represented by qi, q2, or q^, respec- 
tively, if and only if (0 mod p) is represented in a non-trivial way. In particular, 

i) p is represented by qi if and only if p = 2 or p = 1 (mod 4) . 

ii) p is represented by q2 if and only if p = 2 or (^) = 1. The latter means 
p = 1,7 (mod 8). 

iii) p is represented by q^ if and only if p = 2 or {^) = 1. The latter is equivalent 
to p = 1,3 (mod 8). 

2.3.4. Remark. There is the obvious asymptotic estimate 

^{qi{a, b) \a,beZ, qi{a, b) G P, qi{a, b) < n} ^ — . 

21ogn 

Further, 

n 



tl{gi(a,6) \ a,be'Z, \qi{a,b)\ < n} ~ Cj 



Vlogn 



where Ci, C2, and C3 are constants which can be expressed explicitly by Eu- 
ler products. (For qi, this is worked out in [Bri Satz (1.8.2)]. For the other forms, 
J. Briidern's argument works in the same way without essential changes.) 

2.3.5. Congruences modulo 81. 



In case N, gN{y, w) = {2w'^Y — 2(y^)^ = g2(2w^, y"^) where q2 does not represent the 
prime 3. Therefore, if 3|5'Ar(?/,w) then 3|2w^ and 3|?/^ which implies y and w are 
both divisible by 3. By consequence, if ?>\gN{y,w) then, automatically, Sl\gN{y,w). 
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If 3\fN{x,z) but 81\fN{x,z) then f]\f{x,z) does not need to be stored into L^. Fur- 
ther, if 3|x and 3\z then f]^{x,z) does not need to be stored, either, as it can- 
not lead to a primitive solution. This reduces the size of the set Ly. by a factor 
of| + 4-|(|-^) = l-53.9%. 

In case S, the situation is the other way round. fs{,z, w) = (z^)^+(2w^)^ = qi{z'^, 2w^) 
and qi does not represent the prime 3. Therefore, if 3\fs{z,w) then 3|2^ and 3|2t/;^ 
which implies that z and w are both divisible by 3 and 811/5(2;, w). 

We use this in order to reduce the time spent on reading. If 3\gs{x, y) but Sl\gs{x, y) 
or if 3|a; and 3\y then gs{x,y) does not need to be searched for. Although modular 
operations are not at all fast, the reduction of the number of attempts to read 
by 53.9% is highly noticeable. 



2.4 Some more hypothetical improvements 



2.4.1. 

i) In the argument for case N given above, p = 3 might be replaced by any other 
prime p = 3, 5 (mod 8). 

In case S, the same argument as above works for every prime p = 3 (mod 8). 
For primes p = 5 (mod 8), the strategy could be reversed, is a binary quadratic 
form which represents (0 mod p) only in the trivial manner. Therefore, if p\gs{x,y) 
then p\x and p\y. It is unnecessary to store fs{z,w) if p\z and p\w or if ^1/5(2;,^) 
but p^]fs{z,w). 

i') Each argument mentioned may be extended to some primes p = 1 (mod 8). 
For example, in case N, what is actually needed is that 2 is not a fourth power 
modulo p. This is true, e.g., for p = 17, 41, and 97, but not for p = 73 and 89. 

ii) fisf and fs do not represent the residue classes of 6, 7, 10, and 11 modulo 17. 
gN and {—gs) do not represent 1,3, and 9 modulo 13. This could be used to reduce 
the load for writing as well as reading. 

2.4.2. Remarks. 



a) We did not implement these improvements as it seems the gains would be 
marginal or the cost of additional computations would even dominate the effect. It is, 
however, foreseeable that these congruences will eventually become valuable when 
the speed of the CPU's available will continue to grow faster than the speed of mem- 
ory. Observe that alone the congruences noticed in a) could reduce the amount of 
data to be stored into L to a size asymptotically less than eB"^ for any e > 0. 

b) For every prime p different from 2, 5, 13, and 17, the quartic forms /at, gN, fs, 



and gs represent all residue classes modulo p. This means, ii) may not be carried 
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over to any further primes. 

This can be seen as follows. Let b be equal to /at, fs, Qn, or gs- (0 mod p) is 
represented by b, trivially. Otherwise, b{x,y) = r defines an affine curve Cr of 
genus three with at most four points on the infinite line. The Weil conjectures [Wej 
Corollaire 3 du Theoreme 13] imply that [{p+ 1 — Q^/p) —4] is a lower bound for the 
number of Fp-rational points on Cr- This is a positive number as soon as p > 43. 
In this case, every residue class (r mod p) is represented, at least, once. 

For the remaining primes up to p = 41, an experiment shows that all residue classes 
modulo p are represented by /at, fs, Qn, as well as gs- 

3 A 64 bit based implementation of the algorithm 

3.1. We migrated the implementation of Algorithm H from a 32 bit processor 

to a 64 bit processor. This means, the new hardware supports addition and mul- 
tiplication of 64 bit integers. Even more, every operation on (unsigned) integers is 
automatically modulo 2^^. 

From this, various optimizations of the implementation described in |EJ] are al- 
most compelling. The basic idea is that 64 bits should be enough to define hash 
value and control value, two integers significantly less than 2^^ which should be 
independent on each other, by selection of bits instead of using (notoriously slow) 
modular operations. 

Note, however, that the congruence conditions modulo 2 imposed imply that 
x'^ = z'^ = 1 (mod 16) and 2y^ = 4w^ = (mod 16). This means, the four least 
significant bits of / and g may not be used as they are always the same. 

3.2. The description of the algorithm below is based on case S, case N being 

completely analogous. 

Algorithm H64. 

I. Initialization. Fix B := 10^. Initialize a hash table of 2^'^ = 134 217 728 integers, 
each being 32 bit long. Fix the page prime Pp := 200 003. 

Further, define two functions, the hash function h and the control function c, which 
map 64 bit integers to 27 bit integers and 31 bit integers, respectively, by selecting 
certain bits. Do not use any of the bits twice to ensure h and c are independent on 
each other and do not use the four least significant bits. 

II. Loop. Let r run from to Pp — 1 and execute steps \K] and IBTI for each r. 

A. Writing. Build up the hash table, which is meant to encode the set Lr, as follows. 
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a) Find all pairs {z, w) of non-negative integers less than or equal to B which satisfy 
+ = r (mod Pp) and all the congruence-conditions for primitive solutions, 
listed above. (Make systematic use of the Chinese remainder theorem.) 



b) Execute steps [i)J and ii) below for each such pair. 

i) Evaluate fs{z, w) := (z^ + mod 2^^). 

ii) Use the hash value h{fs{z,w)) and linear probing to find a free place in the 
hash table and store the control value c{fs{z,w)) there. 

B. Reading. Search within the hash table, as follows. 

a) Find all pairs (x, y) of non-negative integers less than or equal to B which satisfy 
+ = r (mod Pp) and all the congruence conditions for primitive solutions, 
listed above. (Make systematic use of the Chinese remainder-theorem.) 



b) Execute steps and ii) below for each such pair. 



i) Evaluate gs{x,y) := (x^ + 2y^ mod 2^'^) on all points found in step 

ii) Search for the control value c{gs{x, y)) in the hash table, starting at the hash value 
h{gs{x,y)) and using linear probing, until a free position is found. Report all hits 
and the corresponding values of x and y. 

3.3. Remarks. (Some details of the implementation). 



i) The fourth powers and fourth roots modulo Pp are computed during the initial- 
ization part of the program and stored into arrays because arithmetic modulo Pp is 
slower than memory access. 

ii) The control value is limited to 31 bits as it is implemented as a signed integer. 
We use the value (—1) as a marker for an unoccupied place in the hash table. 

iii) In contrast to our previous programs |EJj . we do not precompute large tables 
of fourth powers modulo 2^"^ because an access to these tables is slower than the 
execution of two multiplications in a row (at least on our computer). 

iv) It is the impact of the congruences modulo 625, 8, and 81, described above, 
that the set of pairs {y, w) [{x, y)] to be read is significantly bigger than the 
set of pairs (x, z) [{z, w)] to be written. They differ actually by a factor of 
ffSs • ffi ■ 2 ~ 33.901 in case N and ^ffts ■ M| ~ 3.601 in case S. 

As a consequence of this, only a small part of the running-time is spent on writing. 
The lion's share is spent on unsuccessful searches within L. 

3.4. Remarks (Post-Processing). 

i) Most of the hits found in the hash table actually do not correspond to solutions of 
the Diophantine equation. Hits indicate only a similarity of bit-patterns. Thus, for 
each pair of x and y reported, one needs to check whether a suitable pair of z and w 
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does exist. We do this by recomputing 2;^ + 4^^ for all z and w which fulfill the 
given congruence conditions modulo Pp and powers of the small primes. 
Although this method is entirely primitive, only about 3% of the total running-time 
is actually spent on post-processing. One reason for this is that post-processing is 
not called very often, on average only once on about five pages. For those pages, 
the writing part of the algorithm needs to be recapitulated. This is, however, not 
time-critical as only a small part of the running-time is spent on writing, anyway, 
ii) An interesting alternative for post-processing would be to apply the theory of 
binary quadratic forms. The obvious strategy is to factorize -|- completely 
into prime powers and to deduce from the decomposition all pairs (a, 6) such that 
_|_ j,2 _ _|_ ^jjgjj^ Qjig jna^y check whether for one of them both a and | are 
perfect squares. 

3.5. Remcirk. The migration to a more bit-based implementation led to an 

increase of the speed of our programs by a factor of approximately 1.35. 

4 Adaption to the memory architecture of our 
computer — generahties 

4.0.1. The factor of 1.35 is less than what we actually hoped for. For that 

reason, we made various tests in order to find out what the limiting bottleneck of 
our program is. It turned out that the major slowdown is the access of the processor 
to main memory. 

Our programs are, in fact, doing only two things, integer arithmetic and memory ac- 
cess. The integer execution units of modern processors are highly optimized circuits 
and several of them work in parallel inside one processor. They work a lot faster 
than main memory does. In order to reach a further improvement, it will therefore 
be necessary to take the architecture of memory into closer consideration. 

4.1 The memory architecture 

4.1.1. The Situation. Computer designers try to bridge the gap between 

the fast processor and the slow memory by building a memory hierarchy which 
consists of several cache levels. 

The cache is a very small and fast memory inside the processor. The first cache level, 
called LI cache, of our processor consists of a data cache and an instruction cache. 
Both are 64 kByte in size. The cache manager stores the most recently used data 
into the cache in order to make sure a second access to them will be fast. 
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If the cache manager does not find necessary data within the LI cache then the pro- 
cessor is forced to wait. In order to dehver data, the cache management first checks 
the L2 cache which is 1024 kByte large. It consists of 16384 fines of 64 Byte, each. 

4.1.2. Our Program. Our program fits into the instruction cache, com- 
pletely. Therefore, no problem should arise from this. 

When we consider the data cache, however, the situation is entirely different. 
The cache manager stores the 1024 most recently used memory fines, each being 
64 Byte long, within the LI data cache. 

This strategy is for sure good for many applications. It guarantees main memory 
may be scanned at a high speed. On the other hand, for our application, it fails com- 
pletely. The reason is that access to our 500 MByte hash table is completely random. 
An access directly to the LI cache happens in by far less than 0.1% of the cases. 
In all other cases, the processor has to wait. 

Even worse, it is clear that in most cases we do not even access the L2 cache. 
This means, the cache manager needs to access main memory in order to transfer 
the corresponding memory line of 64 Byte into the LI cache. After this, the processor 
may use the data. In the case that there is no free line available within the LI cache, 
the cache manager must restore old data back to main memory, first. This process 
takes us 60 nanoseconds, at least, which seems to be short, but the processor could 
execute more than 100 integer instructions during the same time. 

The philosophy for further optimization must, therefore, be to adapt the programs 
as much as possible to our hardware, first of all to the sizes of the LI and L2 caches. 

4.1.3. Programmer's position. Unfortunately, the whole memory hierar- 
chy is invisible from the point of view of a higher programming language, such as C, 
since such languages are designed for being machine-independent. Further, the hard- 
ware executes the cache management in an automatic manner. This means, even by 
programming in assembly, one cannot control the cache completely although some 
new assembly instructions such as prefetch allow certain direct manipulations. 

4.1.4. A way out. A practical way, nonetheless to gain some infiuence on 

the memory hierarchy, is to rearrange the algorithm in an apparently nonsensical 
manner, thereby making memory access less chaotic. One may then hope that the 
automatic management of the cache, when confronted with the modified algorithm, 
is able to react more properly. This should allow the program to run faster. 

4.2 Our first trial 

4.2.1. Our first idea for this was to work with two arrays instead of one. 
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Algorithm M. 

i) Store the values of / into an array and the values of g into a another one. 
Write successively calculated values into successive positions. It is clear that this 
part of the algorithm is not troublesome as it involves a linear memory access which 
is perfectly supported by the memory management. 

ii) Then, use Quicksort in order to sort both arrays. In addition to being fast, 
Quicksort is known to have a good memory locality when large arrays are sorted. 

iii) In a final step, search for matches by going linearly through both arrays as 
in Mergesort. 

4.2.2. Remcirk. Unfortunately, the idea behind Algorithm M is too simple 

to give it any chance of being superior to the previous algorithms. However, it is a 
worthwhile experiment. Indeed, our implementation of Algorithm M causes at least 
30 times more memory transfer compared with the previous programs but, actually, 
it is only three times slower. This indicates that our approach is reasonable. 

5 Hashing with partial presorting 
5.1 The algorithm 

5.1.1. Our final algorithm is a combination of sorting and hashing. An im- 
portant aspect of it is that the sorting step has to be considerably faster than the 
Quicksort algorithm. For that reason, we adopted some ideas from linear-time sort- 
ing algorithms such as Radix Sort or Bucket Sort. 

5.1.2. The algorithm works as follows. Again, the description is based on 

case S, case N being analogous. 

Algorithm H64B. 

I. Initialization. Fix B := 10®. Initialize a hash table H of 2^^ = 134 217728 inte- 
gers, each being 32 bit long. Fix the page prime Pp :— 200 003. 
In addition, initialize 1024 auxiliary arrays Ai each of which may contain 
2^^ = 131 072 long (64 bit) integers. 

Further, define two functions, the hash function h and the control function c, which 
map 64 bit integers to 27 bit integers and 31 bit integers, respectively, by selecting 
certain bits. Do not use any of the bits twice to ensure h and c are independent on 
each other and do not use the four least significant bits. 

Finally, let h^^°^ denote the function mapping 64 bit integers to integers 
within [0, 1023] which is given by the ten most significant bits of h. In other words. 
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for every x, h^^^\x) is the same as h{x) shifted to the right by 17 bits. 

II. Outer Loop. Let r run from to — 1 and execute \K] and [Rl for each r. 

A. Writing. Build up the hash table, which is meant to encode the set Lr, as follows. 

a) Preparation. Find all pairs {z, w) of non-negative integers less than or equal to B 

which satisfy 2;^ + = r (mod pp) and all the congruence-conditions for primitive 

solutions, listed above. (Make systematic use of the Chinese remainder theorem.) 



b) Inner Loop. Execute steps [i)J- iii) below for each such pair. 

i) Evaluate fs{z,w) := {z^ + Aw^ mod 2^4). 

ii) Do not store fs{z,w) into the hash table, immediately. Put i := h^^'^\fs{z,w)), 
first. 

iii) Add fs{z,w) to the auxiliary array Ai. Maintain Ai as an unordered list, i.e. al- 
ways write to the lowest unoccupied address. 

If there is no space left in Ai then output an error message and abort the algorithm. 

c) Storing. Let i run from to 1023. For each i let j run through the addresses 
occupied in Ai. 

For fixed i and j, extract from the 64 bit integer Ai[j] the 27 bit hash value h{Ai[j]) 
and the 31 bit control value c(y4j[j]). 

Use the hash- value /i(ylj[j]) and linear probing to find a free place in the hash table 
and store the control- value c(y4i[j]) there. 

d) Clearing up. Clear the auxiliary arrays Ai for all i G [0, 1023] to make them 
available for reuse. 

B. Reading. Search within the hash table, as follows. 

a) Preparation. Find all pairs (x, y) of non-negative integers less than or equal to B 
which satisfy + 2y^ = r (mod Pp) and all the congruence conditions for primitive 
solutions, listed above. (Make systematic use of the Chinese remainder-theorem.) 



b) Inner Loop. Execute steps [i)]- iii) below for each such pair. 

i) Evaluate gs{x, y) := {x^ + 2y^ mod 2^^). 

ii) Do not look up gsix,y) in the hash table, immediately. Put i := h^^'^\gs{x,y)), 
first. 

iii) Add gs{x,y) to the auxiliary array A^. Maintain Ai as an unordered list, i.e. al- 
ways write to the lowest unoccupied address. 

If there is no space left in Ai then call d[i] and add gs{x,y) to Ai, afterwards. 

c) Searching. Clearing all buffers. Let i run from to 1023. For each z, call d[i]. 
When this is finished, terminate the algorithm. 

Subroutine d[i]) Clearing a buffer. Let j run through the addresses occupied in Ai. 
For fixed j, search for the control value c(y4j[j]) within the hash table H , starting 
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at the hash value h{Ai[j]) and using hnear probing, until a free place is found. 
Report all hits and the corresponding values of x and y. 
Having done this, declare Ai to be empty. 

5.1.3. Remark. The auxiliary arrays play the role of a buffer. Thus, one 

could say that we introduced some buffering into the management of the hash ta- 
ble H. However, this description misses the point. 

What is more important is that the values of fs to be stored into are partially 
sorted according to the 10 most significant bits of h{fs{z, w)) by putting them into 
the auxiliary arrays A^. When the hash table is then built up, the records arrive 
almost in order. The same is true for reading. 

What we actually did is, therefore, to introduce some partial presorting into the 
management of the hash table. 

5.1.4. Remark. It is our experience that each auxiliary array carries more 



or less the same load. In particular, in step IIIJR.b .iii) when the buffers are filled 
up for writing, a buffer overflow should never occur. For this reason, we feel free to 
treat this possibility fatal error. 

5.2 Running-Time 

5.2.1. Algorithm H64B uses about three times more memory than our pre- 
vious algorithms but our implementation runs almost three times as fast. It was 
this factor which made it possible to attack the bound B = 10^ in a reasonable 
amount of time. 

The final version of our programs took almost exactly 100 days of CPU time on an 
AMD Opteron 248 processor. This time is composed almost equally of 50 days for 
case N and 50 days for case S. The main computation was executed in parallel on 
two machines in February and March, 2005. 

5.2.2. Why is this algorithm faster? To answer this question, one has 

to look at the impact of the cache. For the old program, the cache memory was 
mostly useless. For the new program, the situation is completely different. 



When the auxiliary arrays are filled in step IIIJRTb .ii) and IIIJBTb .ii) access to these 
arrays is linear. There are only 1024 of them which is exactly the number of lines 
in the LI cache. When an access does not hit into that innermost cache then the 
corresponding memory line is moved to it and the next seven accesses to the same 
auxiliary array are accesses to that line. Altogether, seven of eight memory accesses 
hit into the LI cache. 
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When an auxiliary array is emptied in step IllJlV.b xl) or IIIJB.b .d[i]). the situation 
is similar. There are a high number of accesses to a very short segment of the 
hash table. This segment fits completely into the L2 cache. It has to be moved into 
that cache, once. Then, it can be used many times. Again, access to the auxiliary 
array is linear and a hit into the LI cache occurs in seven of eight cases. 

All in all, for Algorithm H64B, most memory accesses are hits into the cache. 
This means, at the cost of some more data transfer altogether, we achieved that 
main memory may be mostly used at the speed of the cache. 
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